Part 2. Summarizing distributions (part b)
Suppose we have two RVs \(X\) and \(Y\)
We know the joint PMF/PDF \(f(x, y)\) and joint CDF \(F(x, y)\).
How can we summarize the relationship between \(X\) and \(Y\)?
\[\text{Cov}[X, Y] = {\textrm E}\left[ (X - {\textrm E}[X])(Y - {\textrm E}[Y]) \right]\]
Intuitively, “Does \(X\) tend to be above \({\textrm E}[X]\) when \(Y\) is above \({\textrm E}[Y]\)? (And by how much?)”
\[ f(x,y) = \begin{cases} 1/3 & x = 0, y = 0 \\\ 1/6 & x = 1, y = 0 \\\ 1/2 & x = 1, y = 1 \\\ 0 & \text{otherwise} \end{cases} \]
What is \({\textrm E}[X]\)? What is \({\textrm E}[Y]\)?
Then compute expectation of \((X - {\textrm E}[X])(Y - {\textrm E}[Y])\) (function of two RVs) as above.
Compare:
\[\begin{align}\text{Cov}[X, Y] &= {\textrm E}\left[ \color{blue}{(X - {\textrm E}[X])}\color{orange}{(Y - {\textrm E}[Y])} \right] \\ {\textrm V}[X] &= {\textrm E}\left[ \color{blue}{(X - {\textrm E}[X])}\color{blue}{(X - {\textrm E}[X])} \right]\end{align}\]
Plot the points in \(\text{Supp}[X, Y]\) on two axes with point size proportional to \(f(x, y)\).
Divide the \(x, y\) plane into quadrants defined by \(x = {\textrm E}[X]\) and \(y = {\textrm E}[Y]\).
For each point \((x, y) \in \text{Supp}[X, Y]\), create a rectangle with \((x,y)\) at one corner and \(({\textrm E}[X], {\textrm E}[Y])\) at the opposite corner.
Shade the rectangle green in quadrants I and III (where \((x - {\textrm E}[X])(y - {\textrm E}[X]) > 0\)), otherwise red, with intensity proportional to \(f(x,y)\).
Covariance (roughly) measures how much green vs red there is.
First formulation:
\[\text{Cov}[X, Y] = {\textrm E}\left[ (X - {\textrm E}[X])(Y - {\textrm E}[Y]) \right]\]
As with variance, an alternative formulation:
\[\text{Cov}[X, Y] = {\textrm E}\left[XY\right] - {\textrm E}[X]{\textrm E}[Y]\]
Note:
i.e. (1,1) and (3,3) equally likely
i.e. (1,3) and (3,1) equally likely
\({\textrm E}[X]{\textrm E}[Y]\) is the area below and to the left of the dashed lines
\({\textrm E}[XY]\) is the average of the two areas (given equal probability)
If \(g\) is a linear function or linear operator or linear map, then \(g(x + y) = g(x) + g(y)\). (Additivity property.) Examples?
Recall linearity of expectations: \({\textrm E}[X + Y] = {\textrm E}[X] + {\textrm E}[Y]\).
But \(\text{Var}[X + Y] \neq \text{Var}[X] + \text{Var}[Y]\)
Why not?
\[\begin{aligned} \text{Var}(X+Y) &= {\textrm E}[(X + Y - {\textrm E}[X + Y])^2] \\\ &= {\textrm E}[(X - {\textrm E}[X] + Y - {\textrm E}[Y])^2] \\\ &= {\textrm E}[(w + z)^2] \\\ &= {\textrm E}[w^2 + z^2 + 2 w z] \\\ &= {\textrm E}[w^2] + {\textrm E}[z^2] + {\textrm E}[2 w z] \\\ &= {\textrm E}[(X - {\textrm E}[X])^2] + {\textrm E}[(Y - {\textrm E}[Y])^2] + 2{\textrm E}[(X - {\textrm E}[X])(Y - {\textrm E}[Y])] \\\ &= \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y) \end{aligned}\]
The correlation of two RVs \(X\) and \(Y\) with \(\sigma[X] > 0\) and \(\sigma[Y] > 0\) is
\[ \rho[X, Y] = \frac{\text{Cov}[X, Y]}{\sigma[X] \sigma[Y]}\]
Correlation is scale-invariant: \(\rho[X, Y] = \rho[aX, bY]\) for \(a, b > 0\)
Prove it!
\[\begin{align} \text{Cov}[aX, bY] &= {\textrm E}[aX bY] - {\textrm E}[aX]{\textrm E}[bY] \\ &= ab {\textrm E}[XY] - ab {\textrm E}[X]{\textrm E}[Y] \\ &= ab ({\textrm E}[XY] - {\textrm E}[X]{\textrm E}[Y]) \\ &= ab \text{Cov}[X, Y] \end{align}\]
\[\sigma[aX] = \sqrt{\text{V}[aX]} = \sqrt{a^2 \text{V}[X]} = a \sigma[X]\]
By same argument, \(\sigma[bY] = b\sigma[Y]\).
So
\[\begin{align} \rho[aX, bY] &= \frac{\text{Cov}[aX, bY]}{\sigma[aX] \sigma[bY]} \\ &= \frac{ab \text{Cov}[X, Y]}{a \sigma[X] b \sigma[Y]} = \frac{\text{Cov}[X, Y]}{\sigma[X] \sigma[Y]} \\ &= \rho[X, y] \end{align}\]
We spent time on expectations:
\[{\textrm E}[Y] = \sum_y y f(y).\]
Also on conditional distributions:
\[f_{Y|X}(y|x) = \frac{f(x, y)}{f_X(x)}\]
Combining the two ideas, we get conditional expectations:
\[{\textrm E}[Y \mid X = x] = \sum_y y f_{Y|X}(y \mid x).\]
i.e. the expectation of \(Y\) at some \(x\).
(Red line represents \(E[Y | X = x]\), dots a sample from \(f(x, y)\))
(Red line represents \({\textrm E}[Y | X = x]\), dots a sample from \(f(x, y)\))
Two formulations:
\[{\textrm V}[Y | X = x] = {\textrm E}[(Y - {\textrm E}[Y | X =x])^2 | X = x]\] \[{\textrm V}[Y | X = x] = {\textrm E}[Y^2 | X = x] - {\textrm E}[Y | X =x]^2\]
Two formulations:
\[{\textrm V}[Y | X = x] = {\textrm E}[(Y - {\textrm E}[Y | X =x])^2 | X = x]\]
\[{\textrm V}[Y | X = x] = {\textrm E}[Y^2 | X = x] - {\textrm E}[Y | X =x]^2\]
Conditional expectation \({\textrm E}[Y | X = x]\) is for a specific \(x\).
Conditional expectation function (CEF) \({\textrm E}[Y | X]\) is for all \(x\).
The CEF \({\textrm E}[Y | X]\) is the expectation of \(Y\) at each \(X\).
We already established that the expectation/mean is the best (in MSE sense) predictor.
So CEF is the best possible way to use \(X\) to predict \(Y\). (See Theorem 2.2.20.)
Multivariate generalization: \({\textrm E}[Y \mid X_1, X_2, X_3, \ldots, X_n]\) is the best way to use \(X_1, \ldots X_n\) to predict \(Y\).
For random variables \(X\) and \(Y\),
\[{\textrm E}[Y] = {\textrm E}[{\textrm E}[Y | X]]\]
This means there are two ways to get \({\textrm E}[Y]\):
In words: An unconditional average (\({\textrm E}[Y]\)) can be represented as a weighted average of conditional expectations (\({\textrm E}[Y \mid X]\)) with weights taken from the distribution of the variable conditioned on, i.e. \(X\).
Why would you want to do that?
A population is 80% female and 20% male.
The average age among females (\({\textrm E}[Y | X = 1]\)) is 25. The average age among males \({\textrm E}[Y | X = 0]\) is 20.
What is the average age in the population, \({\textrm E}[Y]\)?
\[{\textrm E}[{\textrm E}[Y | X]] = .8 \times 25 + .2 \times 20 = 24\]
See homework for another example.
Suppose we want to measure the average effect of participating in a program (e.g. job training, voter education, military mobilization).
Call \(Y\) the (unobservable) effect of the treatment. We want the average treatment effect (ATE), \({\textrm E}[Y]\).
Suppose that comparing participants and non-participants gives us a good estimate of the average treatment effect only within subgroups defined by age (\(X\)).
So we have \({\textrm E}[Y \mid X]\).
Now we just combine these estimates (by LIE): \({\textrm E}[Y] = {\textrm E}[{\textrm E}[Y \mid X]] = \sum_{x} {\textrm E}[Y \mid X = x] f(x)\)
\[{\textrm V}[Y] = {\textrm E}[{\textrm V}[Y|X]] + {\textrm V}[{\textrm E}[Y|X]]\]
In words, the variance of \(Y\) can be decomposed into the expected conditional variance (\({\textrm E}[{\textrm V}[Y|X]]\)) and the variance of the conditional expectation (\({\textrm V}[{\textrm E}[Y|X]]\)).
Sometimes called “Ev(v)e’s law” because
\[{\textrm V}[Y] = \color{red}{{\textrm E}}[\color{red}{{\textrm V}}[Y|X]] + \color{red}{{\textrm V}}[\color{red}{{\textrm E}}[Y|X]]\]
Suppose we want to predict \(Y\) using \(X\), and we focus on a linear predictor, i.e. a function of the form \(\alpha + \beta X\).
The best (minimum MSE) predictor satisfies
\[(\alpha, \beta) = \underset{(a,b) \in \mathbb{R}^2}{\arg\min} \, {\textrm E}\,[\left(Y - (a + bX)\right)^2]\]
The solution (see Theorem 2.2.21) is
So we could obtain the BLP from a joint PMF. (See homework.)
Above, we were looking for best linear predictor (BLP) of \(Y\) as function of \(X\):
\[(\alpha, \beta) = \underset{(a,b) \in \mathbb{R}^2}{\arg\min} \, {\textrm E}[\left(Y - (a + bX)\right)^2]\]
Same answer if you look for the best linear predictor of the CEF \(E[Y | X]\):
\[(\alpha, \beta) = \underset{(a,b) \in \mathbb{R}^2}{\arg\min} \, {\textrm E}[\left({\textrm E}[Y|X] - (a + bX)\right)^2]\]